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ABSTRACT 

In  the  past  decade  a  number  of  fixed  sampling  methods  have 
been  developed  for  selecting  the  "best"  or  at  least  a  "good"  sub¬ 
set  of  variable  in  regression  analysis.  We  are  interested  in 
deriving  a  sequential  selection  procedure  to  select  a  subset  of  a 
random  size  including  all  good  regression  equations.  Tables  for 
an  example  are  given  at  the  end  of  this  paper. 


1.  INTRODUCTION 

In  the  past  decade  a  number  of  fixed  sampling  methods  have 
been  developed  for  selecting  the  "best"  or  at  least  a  "good"  sub¬ 
set  of  variables  in  regression  analysis  (see  e.g.  Arvesen  and 
McCabe  (  1975)  and  Spjyitv/ftl  1  (  1972)).  In  this  paper,  we  are  inter 
ested  in  deriving  a  sequential  selection  procedure  to  select  a 
random  size  subset  including  all  "good"  regression  equations. 


Tables  for  an  example  are  given  at  the  end  of  this  paper. 

2,  SEQUENTIAL  SUBSET  SELECTION  PROCEDURE 

Before  discussing  the  regression  problem,  we  develop  general 
results  applicable  to  the  selection  of  "good"  or  "superior"  popu¬ 
lations  defined  later. 

Let  Tig,  7!-j,...,nk  denote  k+1  normal  populations  with  unknown 

2  2  2  2 
means  uq»u-|  ».  • .  and  variances  og,o-| , . . .  .  Assume  that  a g 


is  known  but 


o.  (1  <  i  <  k)  are  unknown.  Let  the  ranked 


2  2  2 

values  of  be  denoted  by  Oj--|j  <_...<  We  wish  to  derive  a 

method  to  construct  a  sequential  procedure  to  select  a  subset  con¬ 
taining  all  "superior"  populations  -  the  populations  with  smaller 

variances,  with  a  probability  not  less  than  P*,  (0  <  P*  <  1),  a 

? 

specified  constant.  We  assume  that  og  =  1. 

Let  X-n  denote  the  nth  observation  from  population  tt^  .  It  is 
assumed  that  the  observations  X^,...,Xin  are  independent  random 
variables.  Define 


‘in  -S  L  V 


Sfn  -  n'l  j,  <hj  - 

2 

The  selection  procedure  will  depend  upon  { s  ^  n)  which  is  a  suffi¬ 
cient  and  transitive  sequence  and  also  invariantly  sufficient  for 

A  population  *.  is  said  to  be  "superior"  (or  "good")  if 

o ?  -  A,  to  be  "inferior"  (or  "bad")  if  o?  >  A,  where  a  is  a  speci¬ 
fied  constant  greater  than  1.  Let  u  be  the  parameter  space  which 
is  the  collection  of  all  possible  parameter  vectors  8  = 

(o^ , . . .  ,o^) .  Let  t  denote  the  unknown  number  of  inferior 


populations  in  the  given  collection  of  k  populations.  We  have 
0  <  t  ^  k.  Let 

s;t  =  *- l°[l]  °[k-t]  1  A  "  °[k-t+l]  °[k] 1 ' 

k 

Then  n  =  u  a.. 

t=0  1 

For  the  subset  selection  procedure  R,  two  constants  A  and  P* 
with  a  •  1,  1  >  P*  •*  0,  are  specified  and  we  wish  to  select  a  sub¬ 
set  containing  all  superior  populations  with  a  probability  of  at 
least  P*.  When  all  the  superior  populations  are  contained  in  the 
selected  subset,  we  say  a  correct  decision  (CD)  has  been  made. 

Thus  we  require  a  procedure  for  which 

P0(CD|R)  >  P* 

for  al 1  o  e  n. 

2  2 
Let  g  2>(sjn^  denote  the  probability  density  of  s^n  depend- 

°i 

2 

ing  on  the  parameter  o^.  We  define  the  log-likelihood  ratios 

an(sin}  =  lQ9  9A(s^n)  -  log  g^s^)  (2.1) 

upon  which  the  procedure  is  based. 

El imi nation  type  sequential  selection  procedu re  R  for  selecting 
the  superior  popul ations. 

Begin  by  taking  n^  (^  1 )  independent  observations  from  each 

of  the  k  populations.  Calculate  the  values  of  the  k  log-like- 

2 

lihood  ratios  3_n  (s.n  ),  1  <  i  <_  k.  For  any  i,  if 

\(sWia- 

where  a  =  log(k(k+1  )/2(  1-P*) ) ,  we  eliminate  the  population  tt..  from 
further  consideration.  We  proceed  to  the  next  (second)  stage  by 
taking  n^  -  n-j  independent  observations  on  each  of  the  remaining 
populations.  The  log-likelihood  ratios  for  the  contending  popu¬ 
lations  are  again  computed  and  the  same  elimination  rule  is  used 


2 

except  that  n  (s-  )  everywhere  replaces 

n  ^  1  n  ^ 


We  continue 


in  this  manner  until  the  elimination  is  stopped,  at  which  time  the 
procedure  is  terminated  with  the  declaration  that  the  remaining 
populations  are  the  superior  populations.  If  after  applying  this 
rule  at  the  sth  stage  (say),  the  number  of  remaining  populations 
is  zero,  then  we  select  the  population  ug  which  is  the  control 
population. 

Note  that  is  the  sample  size  of  that  stage  of  the  proce¬ 
dure  at  which  a  decision  may  be  made,  for  the  first  time,  to 
reject  one  or  more  populations.  Let  n^  >  n^  be  the  sample  size  of 
the  next  stage  of  the  procedure  at  which  such  a  decision  may  be 

made,  and  in  general  let  ng  >  n$_^  be  the  sample  size  of  the  stage 

of  the  procedure  at  which  the  sth  decision  to  reject  one  or  more 
populations  may  be  made.  Let  N  be  the  stage  at  which  the  proce¬ 
dure  terminates.  It  is  clear  that  if  there  are  k  populations  to 

start  with,  then  N  <  n^  (see  Gupta  and  Huang  (1975)). 

We  assume  that 

P  ? n ^ s f n ^  --  a  for  some  (2.2) 


2 

is  a  nondecreasi ng  function  of  o-.  A  sufficient  condition  for 
this  is  discussed  by  Hoel  (1970).  Without  loss  of  generality,  we 
assume  that  r^,...,nk  t  are  the  superior  populations.  Since  the 
procedure  K  is  truncated,  we  have 

l-P(CDiR)  Pit  (s2  )  >  a  for  some  i  =  l,...,k-t, 

'  1  n  i  n 

for  some  t,  0  •  t  <-  k,  for  some  ni 
k  k  - 1 

V  )'  P  v  (s£  )  -•  a  for  some  n] 

t  o  i-1  <-2  n  in 


k  k  - 1  ? 

<  )  V  P  {.■  (s.  )  -•  a  for  some  ni 

t-0  i-1  "  n  in 

k  , 

>  (k-t)e’a  '  \  k ( k  + 1 ) e  a  = 

t‘o 


1-P*. 


3.  APPLICATIONS  TO  SELECTION  OF  "GOOD" 
"  OR  "SUPERIOR"  REGRESSION  EQUATIONS 


Assume  the  following  standard  linear  model  as  follows, 

Y  =  Xfl  +  e  (3.1) 

where  X  is  an  nxp  known  matrix  of  rank  p  <_  n,  6  is  a  pxl  parameter 

2 

vector,  and  e  -  N(0,  ogIn).  Consider  the  models  for  any  r, 

2  <  r  <  p-1 , 


Y  Xrieri  +  fri 


(3.2) 


where  Xpl.  is  an  nxr  matrix  of  rank  r  with  X^  =  [1,...,1]^  , 

2 

is  a  rxl  parameter  vector,  and  e  .  -  N(0,  o I  ) ,  where  i  =  l,...,kr= 


(£"}).  Let  *<=  l  k  .  The  goal  is  to  include  all  the  designs  Xr • 
r=2 

2 

(or  sets  of  independent  variables)  associated  with  cir.j,  j  = 

1 ,. . .  ,k-t . 

Note  that  for  any  r,  2  <  r  <_  p-1,  if 

SSri  ■  HI  -  Xr,(XH*ri>''XriIV  * 

then  following  Searle  (1972,  p.  57) 

SSri/o0  *2{V  (X3),Qri(X3)/(2n02)], 
where  vr  =  n-r,  for  1  i  <  k  .  Note  that  the  noncentrality 
parameter  is  not  zero  in  general  and  that 

°ri  =  °0  +  (xe)'Qri(XB)/vr> 

2 

If  og  is  not  equal  to  1,  then  we  consider  the  linear  model  Y/og= 
Xfi/nn+  t'  -  N(0,I  ).  Thus  we  assume  without  loss  of  general- 

Up  ** 

i ty  that  og  =  1 . 

We  know  that  the  non-central  x^(x»x)  with  non-centrality 
parameter  \  has  monotone  likelihood  ratio  in  x.  Hence  the  monoto¬ 
nicity  of  (2.2)  is  satisfied.  We  can  apply  the  sequential  proce- 

2 

dure  R  to  select  superior  regression  equations  by  replacing  sn-  n 
byssr1./vr. 


p-1 


4_,_  COMPUTATION  OF  (2.1) 


Let  U  .  =  SS  ./\>  .  The  probability  density  of  U  ,  is 
r  i  nr  n 


9  2  (Uri) 
J  ri 


-X  v  A  (vrUri} 

e  ). 


1  vr+k-1  ‘  l(vrUri) 


k=0  i  v  +  k  , 

k!  22  r  r(£  vr  +  <) 


where  =  1  +  (Xb)  'Qrj  (XB)/vr ,  vf  =  n-r  and  >  =  (  Xb)  '  Qr  f  ( XB)/2. 
?  2 

If  •  1,  then  \  =  0  and  if  o  .  =  A,  then  x  =  (a-1)v  /2.  Hence 


r  i 


ri 


n/.{uri)A‘i(uri}  =  e"AJ0  rr[-r  n 


k  r(z  vr) 


r ( p  +  k) 


(4.1) 


where  '  -  (/'-l  )vp/2.  Let 


k! 


u  U  ■ 

r  ri 


k  r(£  vr) 


r(j  vr  +  k) 


k  =  0,1,2,. 


Since 


k  + 1 


a.  k  +  1 
k 


VUril  1 

2"J  (1.  + 


-♦0  as  -k  -+-«*, 
then  for  any  0  •  6  <  1,  there  exists  q  such  that 


<2  vr  +  k) 


k  +  1 

-  -  -  <  A  •  1 ,  for  all  k  >  q. 

ak 

Let  us  consider  the  error  due  to  the  truncation  of  the  series 
in  (4.1).  Let  q  be  the  number  of  terms  in  the  truncated  series. 
Then  the  error  due  to  truncation  of  the  series  in  (4.1)  is  given 
by 

y  i  ■  dfi 

ki-o  «+k  -  ^ 


Given  n  >  0,  let  kQ  be  the  smallest  positive  integer  k  such  that 


ak  .  ak+l  dk  . 

—  <  1  and - +  —  <  1 . 

n  a.  n  - 


For  this  k~,  it  is  easy  to  prove  that 


0  '  9A(ur,)/g,(uri)  - 


V1 

l  ^ 

k=0  * 


-  I  a 


/.  +lf  1  n- 

k=0  K0  K 


kQ-l 


Thus  9A(Ur.)/g1(Uri)  •  l  a,  with  error  less  than  n.  To  evaluate 


k=0 


9A(uri )/91(Uri),  the  computation  is  very  efficient. 


5.  EXAMPLE 

In  this  section  we  present  an  example  which  will  serve  to 
illustrate  the  sequential  subset  selection  procedure.  The  data 
set  is  taken  from  Neter  and  Wasserman  (1974,  p.  373),  who  used  it 
to  illustrate  several  methods  of  finding  a  "best"  set  of  indepen¬ 
dent  variables. 

There  are  n  =  55  observations  on  p  =  5  independent  variables. 
Then  k  =  2  -  2  =  14.  For  the  subset  selection  procedure  R,  two 
constants  a  and  P*  with  a  ;•  1,  1  >  P*  >  0,  are  specified  and  we 
wish  to  select  a  subset  containing  all  superior  regression  equa¬ 
tions  with  probability  at  least  P*. 

Begin  by  taking  n?  (>_  5)  independent  observations.  Calculate 
the  values  of  the  k  ratios  gA ( Uri  )/g-j( Uri )  with  error  n  (specified). 
For  any  r ,  i ,  If 

9a<UH>'9,«V,>  ^ 

where  b  =  k(k+l )/2( 1 -P* ) ,  we  eliminate  the  regression  equation 
from  further  consideration.  We  proceed  to  the  second  stage  by 
taking  n^  -  n^  independent  observations  on  each  of  the  remaining 
regression  equations.  The  ratios  are  again  computed  and  the  same 
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elimination  rule  is  used.  We  continue  in  this  manner  until  the 
elimination  is  stopped,  at  which  time  the  procedure  is  terminated 
with  the  declaration  that  the  remaining  regression  equations  are 
the  superior  regression  equations. 

Let  n  =  0.1.  For  the  value  of  g  ( Ur^  )/g -j  ( L>r ^ ,  this  error  of 
n  -  0.1  is  small  enough  with  respect  to  constant  b.  Table  1 1 -V 1 1 
are  the  subsets  of  independent  variables  of  elimination  for  the 
sequential  subset  selection  procedure  R. 

Table  III,  we  consider  a  =  1.2.  If  P*  =  0.9,  then  the  proce¬ 
dure  R  eliminates  (X^,  X^),  { X ^ ,  X^)  and  (X^,  X^)  at  stage  1 
(n^  =  11);  eliminate  (X^,  X^,  X^)  at  stage  2  (n^  =  16)  and  elimi¬ 
nate  (X]  ,  X3,  X^)  at  stage  3  (n^  =  21).  No  subset  is  eliminated 
at  stage  4  (n^  =  26).  Thus  the  procedure  is  terminated.  (X^.X^), 

(xr  x2,  x3),  (xr  x2,  x4),  (xr  x2,  x5),  (xr  x3,  x5),  (xr  x2, 

X3,  X4),  (X,,  X2,  X3,  X5),  (Xr  X2,  X4,  X5)  and  (Xp  X3>  X4,  X&) 
are  the  set  of  variables  of  superior  regression  equations.  We  can 
use  C  statistic  to  select  one  of  good  regression  equations  among 
the  set  of  superior  regression  equations.  For  this  example, 

(X-j,  X^,  X4)  is  the  set  of  variable  of  a  good  regression  equation 
(cf.  Neter  and  Wasserman  (1974)).  Table  II-VII  represents  the 
results  for  a  =  1.1,  1.2,  1.5,  2,  3  and  5;  P*  =  0.7,  0.8  and  0.9. 

TABLE  II 


=  0.1,  A  =  1.1. 


p* 

n 

16 

21 

26 

31 

0.7 

(1.477(1,30 

(1,5) 

0,4, 50,  (1,3.4) 

no  rejection 

— - 

0.8 

(TOO  ,Ti  ,3) 

71,4.50707X00 

0,4) 

no  rejection 

— 

0.9 

0.50. (1,30 

. _ _ 

o  ,4,5i,Ti  .or 

0,3,4) 

no  rejection 

n 


TABLE  III 
=  0.1,  A  =  1.2. 


11 


0  , 4 )  ,Jl , 3 ) 
775770,4) 
1U3JL_. 


16 

21 

26 

no  rejection 

— 

5) 

no  rejection 

( 1T475) 

7,3/4 T~ 

no  rejection 

TABLE  IV 


n  r  0.1,  A  =  1.5. 


6 

11 

16 

(1,47,0,37 

TOOT 

(1,4,5T,0757 

0,3,4) 

no  rejection 

71,4757,737 

0,3, 4),  (1,4) 

no  rejection 

( 1 ,37 

0,4,57,7777 

0,3, 4), (1,4) 

no  rejection 

n 

TABLE  V 

=  0.1,  A  -=  2. 

6 

11 

16 

71,4,5")  ,  (1,57 
0,4),  1,3) 

0,3,77 

no  rejection 

"0757,71,47  ■ 
p,3) 

(1,4,57,773,47 

no  rejection 

0,5770,4) 

(1,3) 

(1,  4 ,57,777,7 

no  rejection 

TABLE  VI 


n  =  0.1,  A  =  3. 


n 

P* 

6 

11 

16 

0:7 

( 1 » 4 ,5 )",  ( 1  ,  5^TH  ,  3 ,4") 
CM),  1,3) 

no  rejection 

0.8* . 

( 1,4,5  i,n.5i 

1L.4),  1,3).  ...  . 

no  rejection 

0.9 . 

(t!4,5T,tl,5) 

(1,4), (1,3) 

(173, 4T 

no  rejection 

TABLE  VII 


n  -  0.1,  .A  =  5. 
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